-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32478][R][SQL] Error message to show the schema mismatch in gapply with Arrow vectorization #29283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8ed454a to
ab5ecde
Compare
| val actualDataTypes = (0 until batch.numCols()).map(i => batch.column(i).dataType()) | ||
| assert(outputTypes == actualDataTypes, "Invalid schema from gapply(): " + | ||
| s"expected ${outputTypes.mkString(", ")}, got ${actualDataTypes.mkString(", ")}") | ||
| batch.rowIterator().asScala |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is same as dapply:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala
Lines 247 to 251 in 17586f9
| columnarBatchIter.flatMap { batch => | |
| val actualDataTypes = (0 until batch.numCols()).map(i => batch.column(i).dataType()) | |
| assert(outputTypes == actualDataTypes, "Invalid schema from dapply(): " + | |
| s"expected ${outputTypes.mkString(", ")}, got ${actualDataTypes.mkString(", ")}") | |
| batch.rowIterator.asScala |
|
@viirya can you take a quick look when you're available? |
|
Test build #126766 has finished for PR 29283 at commit
|
|
Test build #126767 has finished for PR 29283 at commit
|
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks nice, the improved error message. Just one minor comment about the doc.
|
LGTM |
|
Merged to master and branch-3.0. Thanks @viirya. |
…pply with Arrow vectorization
### What changes were proposed in this pull request?
This PR proposes to:
1. Fix the error message when the output schema is misbatched with R DataFrame from the given function. For example,
```R
df <- createDataFrame(list(list(a=1L, b="2")))
count(gapply(df, "a", function(key, group) { group }, structType("a int, b int")))
```
**Before:**
```
Error in handleErrors(returnStatus, conn) :
...
java.lang.UnsupportedOperationException
...
```
**After:**
```
Error in handleErrors(returnStatus, conn) :
...
java.lang.AssertionError: assertion failed: Invalid schema from gapply: expected IntegerType, IntegerType, got IntegerType, StringType
...
```
2. Update documentation about the schema matching for `gapply` and `dapply`.
### Why are the changes needed?
To show which schema is not matched, and let users know what's going on.
### Does this PR introduce _any_ user-facing change?
Yes, error message is updated as above, and documentation is updated.
### How was this patch tested?
Manually tested and unitttests were added.
Closes #29283 from HyukjinKwon/r-vectorized-error.
Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
|
Test build #126798 has finished for PR 29283 at commit
|
What changes were proposed in this pull request?
This PR proposes to:
Fix the error message when the output schema is misbatched with R DataFrame from the given function. For example,
Before:
After:
Update documentation about the schema matching for
gapplyanddapply.Why are the changes needed?
To show which schema is not matched, and let users know what's going on.
Does this PR introduce any user-facing change?
Yes, error message is updated as above, and documentation is updated.
How was this patch tested?
Manually tested and unitttests were added.